Github: https://github.com/pavanchavda/PC-ANLY506/tree/master/Code
The purpose of this research project is to analyze the relationship between income, life expectancy, and population variables for countries around the world using the gapminder dataset. I will use the techniques and knowledge gained about the exploratory data analysis through ANLY 506 course to explore the data, clean it, and create interesting visulizations that would help answer the research questions that I have posed below. For visulizations, I will use the combination of boxplot, interactive scatterplot, tree maps, etc. to explore the data visually.
The focus of this research project will be on these questions:
The data used in this study was compiled by Gapminder Foundation. Gapminder Foundation is a non-profit organization that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels.[Gapminder Wikipedia]. Our dataset contains a total of 41,824 records and has 6 different variables: country, year, life, population, income(aka GDP per Capita), and region. The dataset contains data starting year 1800 to 2015.
Before we dive into the results and start creating visualizations, let’s first explore the data so we can familiarize ourself with the structure of the data and so we can ensure that the data we are using is accurate and of good quality. If the data is inaccurate, so will be the results.
We will run the str()function on our data to review its structure. The str() function reveals that population variable is saved as factor and year variable is saved as integer data type. Let’s convert the population datatype to numeric so and year variable to factor. I also noticed that some variable starts with lower letter and some variable names start with upper letter. For consistency, I also updated the variable name to start with upper letter.
str(data)
## 'data.frame': 41284 obs. of 6 variables:
## $ Country : Factor w/ 197 levels "Ã
land","Afghanistan",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : int 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 ...
## $ life : num 28.2 28.2 28.2 28.2 28.2 ...
## $ population: num 3280000 NA NA NA NA NA NA NA NA NA ...
## $ income : int 603 603 603 603 603 603 603 603 603 603 ...
## $ region : Factor w/ 6 levels "America","East Asia & Pacific",..: 5 5 5 5 5 5 5 5 5 5 ...
#Change the datatype for population and Year column
data$population <- as.numeric(as.character(data$population))
data$Year <- as.factor(data$Year)
#Update column names for variables
colnames(data)<- c("Country", "Year", "LifeExpectancy", "Population", "Income", "Region")
The top and bottom of the data reveals no immeadiate concern about the quality of the data.
#Check top few rows of the data
head(data)
## Country Year LifeExpectancy Population Income Region
## 1 Afghanistan 1800 28.21100 3280000 603 South Asia
## 2 Afghanistan 1801 28.20075 NA 603 South Asia
## 3 Afghanistan 1802 28.19051 NA 603 South Asia
## 4 Afghanistan 1803 28.18026 NA 603 South Asia
## 5 Afghanistan 1804 28.17001 NA 603 South Asia
## 6 Afghanistan 1805 28.15977 NA 603 South Asia
#Check botton few rows of the data
tail(data)
## Country Year LifeExpectancy Population Income Region
## 41279 Ã
land 1992 80.83 24834 NA Europe & Central Asia
## 41280 Ã
land 1993 81.80 24950 NA Europe & Central Asia
## 41281 Ã
land 1994 80.63 25066 NA Europe & Central Asia
## 41282 Ã
land 1995 79.88 25183 NA Europe & Central Asia
## 41283 Ã
land 1996 80.00 25301 NA Europe & Central Asia
## 41284 Ã
land 1997 80.10 25419 NA Europe & Central Asia
Number of rows and column matches what I had expected.
#Check number of rows in our data
nrow(data)
## [1] 41284
#Check number of columns in our data
ncol(data)
## [1] 6
There are 197 unique countries in our dataset. Currently there are 195 countries in the world, but our dataset contains data from year 1800 so there might be data for countries that do not exist now. The countries are divided into 6 regions: America, East Asia & Pacific, Europe & Central Asia, Middle East & North Africa, South Asia, Sub-Saharan Africa.
#Counting the number of distinct countries in our data using length and unique
length(unique(data$Country))
## [1] 197
#Checking the frequency of regions in the data
table(data$Region)
##
## America East Asia & Pacific
## 7961 6256
## Europe & Central Asia Middle East & North Africa
## 10468 4309
## South Asia Sub-Saharan Africa
## 1728 10562
Summary of the data shows that population variable contains about 25,817 NA’s and income variable contains 2,341 NA’s. A quick glimpse at the data shows The NA’s in population variable are due to the fact that the data is only available every 10 years until 1950. For NA’s in income variable, lets examine further to see which countries we are missing the data for.
#Summarize the data to obtain descriptive statistics
summary(data)
## Country Year LifeExpectancy
## Afghanistan : 216 1997 : 197 Min. : 1.00
## Albania : 216 1988 : 196 1st Qu.:31.00
## Algeria : 216 1989 : 196 Median :35.12
## Angola : 216 1990 : 196 Mean :42.88
## Antigua and Barbuda: 216 1991 : 196 3rd Qu.:55.60
## Argentina : 216 1992 : 196 Max. :84.10
## (Other) :39988 (Other):40107
## Population Income Region
## Min. :1.548e+03 Min. : 142 America : 7961
## 1st Qu.:5.335e+05 1st Qu.: 883 East Asia & Pacific : 6256
## Median :3.358e+06 Median : 1450 Europe & Central Asia :10468
## Mean :2.119e+07 Mean : 4571 Middle East & North Africa: 4309
## 3rd Qu.:1.078e+07 3rd Qu.: 3483 South Asia : 1728
## Max. :1.376e+09 Max. :182668 Sub-Saharan Africa :10562
## NA's :25817 NA's :2341
The table below shows the number of years we are missing the data for income varialble by country. Out of those 15 countries that have, only Croatia is a major country with somewhat significant population. Rest of the countries are very small in terms of overall population and will not skew our analysis.
#Create table that shows the number of years we are missing the income data for
kable(data %>%
group_by(Country) %>%
summarise_all(funs(sum(is.na(.)))) %>%
filter(Income>0),format = "html",padding = 2,table.attr = "id=\"mytable\"")
| Country | Year | LifeExpectancy | Population | Income | Region |
|---|---|---|---|---|---|
| Ã land | 0 | 0 | 0 | 10 | 0 |
| Channel Islands | 0 | 0 | 0 | 55 | 0 |
| Croatia | 0 | 0 | 135 | 20 | 0 |
| French Guiana | 0 | 0 | 135 | 205 | 0 |
| French Polynesia | 0 | 0 | 135 | 205 | 0 |
| Guadeloupe | 0 | 0 | 135 | 205 | 0 |
| Guam | 0 | 0 | 135 | 205 | 0 |
| Martinique | 0 | 0 | 135 | 205 | 0 |
| Mayotte | 0 | 0 | 135 | 205 | 0 |
| Netherlands Antilles | 0 | 0 | 150 | 205 | 0 |
| New Caledonia | 0 | 0 | 135 | 205 | 0 |
| Reunion | 0 | 0 | 135 | 205 | 0 |
| Tokelau | 0 | 0 | 0 | 1 | 0 |
| Virgin Islands (U.S.) | 0 | 0 | 135 | 205 | 0 |
| Western Sahara | 0 | 0 | 135 | 205 | 0 |
So far We have cleaned up the data and use exploratory data analysis techniques to familiarize ourself with the data. Now the fun part begins. We can go ahead and start exploring the data visually.
While there is a vast amount of data available to analyze, for the purpose of this study, I will only be analysing the data for the most recent year available: year 2015. So lets go ahead and create a new data set with only 2015 data first.
#Filter data for year 2015 and assigning it to a new varialble called data2015
data2015 <- data %>%
filter(Year==2015)
Figure 1 below shows that Europe & Central Asia has the highest income of around $25,000. On the other hand, South Asia and Sub-Saharan regions have the lowest GDP per capita. We can also see that the inter quartile range (IGR) for Middle East & North Africa is much larger than other regions. This is likely because of the large difference in GDP per capita for middle east and north african countries.
#Plot boxplot
ggplot(data2015,aes(Region,Income,fill=Region))+geom_boxplot()+
labs(title = "Figure 1: GDP per Capita by Region in 2015",x="Region",y="Income(GDP per Capita)")+
scale_y_continuous(labels = scales::comma, breaks=seq(0,150000,25000))+
theme(panel.background = element_blank(), panel.grid = element_blank(),legend.position = "none",axis.text.x=element_text(angle = 10,vjust = 0),axis.title.x = element_blank())
Figure 2 shows a treemap of income and population by country and region. The size of the rectangle represents the population of the country while the color of the country represents the income. Figure 2 shows that income in Europe & Central Asia and America is much higher compared to other regions while Sub-Saharan Africa region has the least income. We can also see that approximately 25 countries in East Asia & Pacific and South Asia region represent about 50% of the global population. Majority of the population is represented by India and China in those regions. Figure 2 also shows that countries with lower population usually has higher GDP per capita. This is not very surprising considering the GDP per capita takes the population of the country into account.
#Plot treemap
treemap(data2015,
index=c("Region", "Country"),
vSize="Population",
vColor="Income",
type="value",
format.legend = list(scientific = FALSE, big.mark = " "),
title = "Figure 2: Treemap of Population and Income by Country and Region",
overlap.labels = 0.5,
border.col = "black",
palette="RdBu")
Figure 3 below shows an interactive scatter plot of life expectancy and income by country and Region. Figure 3 also indicates that until life expectancy of 70 years, the income doesnt seem to affect it. However, after 70 years, the countries with higher GDP have much higher life expectancy. This proves that people living in countries with higher income will have higher chance of living above the age of 70 years. The graph also shows that countries in Sub-Saharan region have the least life expectancy and are all scattered around each other. On the other hand, countries in Europe & Central Asia and America regions have the highest life expectancy.
#Plot scatterplot
g1 <- ggplot(data2015,aes(LifeExpectancy,Income,group=Country,col=Region))+
geom_point()+
theme_classic()+
labs(title="Figure 3: Scatterplot of Life Expectancy and Income by Country and Region")+
theme(legend.title = element_blank(), panel.background = element_blank(), panel.grid = element_blank())+
scale_y_continuous(labels=scales::comma, breaks = c(25000,50000,75000,100000))
plotly::ggplotly(g1)
The visual between population and life expectancy did not indicate any strong correlation or any interesting results and therefore is not included.
In conclusion, the exploratory data analysis of the dataset reveleaed that the quality of the data is very good. There are a few countries that are missing some data for many years, however they represent a very small proportion of the world population. The visualizations of the dataset revealed that Europe & Central Asia and America regions have the highest median GDP per capita while South Asia and Sub-Saharan Africa regions have the lowest GDP per capita in 2015. The data also showed that there is some correlation between population and income as the countries with lower income usually have higher GDP per capita than countries with very high population. There is also a strong correlation between the income and life expectancy. The data showed that countries with higher income have a life expectancy of 70 years or above.